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Abstract. We present an algorithm that computes the Lempel-Ziv de¬ 
composition in 0(n(log a + log log n)) time and n log a + en bits of space, 
where e is a constant rational parameter, n is the length of the input 
string, and a is the alphabet size. The n log cr bits in the space bound 
are for the input string itself which is treated as read-only. 


1 Introduction 

The Lempel-Ziv decomposition m is a basic technique for data compression and 
plays an important role in string processing. It has several modifications used 
in various compression schemes. The decomposition considered in this paper is 
used in LZ77-based compression methods and in several compressed text indexes 
designed to efficiently store and search massive highly-repetitive data sets. 

The standard algorithms computing the Lempel-Ziv decomposition work in 
0(n log crjil time and 0(n log n) bits of space, where n is the length of the input 
string and a is the alphabet size. It is known that this is the best possible time 
for the general alphabets M- However, for the most important case of integer 
alphabet, there exist algorithms working in 0{n) time and 0(n log n) bits (see 
[8] for references). When a is small, this number of bits is too big compared to 
the n log a bits of the input string and can be prohibitive. To address this issue, 
several algorithms using 0{n log a) bits were designed. 

The main contribution of this paper is a new algorithm computing the 
Lempel-Ziv decomposition in 0{n(loga + log log n)) time and n log cr -|- en bits 
of space, where e is a constant rational parameter. The nlogu bits in the space 
bound are for the input string itself which is treated as read-only. The following 
table lists the time and space required by existing approaches to the Lempel-Ziv 
parsing in 0(n log cr) bits of space. 


Time 

Bits of space 

Note 

Author (s) 

0{n logu) 
0(nlog^ n) 

0(n log^ n) 

0{n logn) 

0{n logn loglogcr) 
0(n(logcr + log logn)) 

0(n logu) 
n logcr + 0{n) 
0(n logcr) 
0(n logu) 
n log cr -b en 
n log cr -b en 

online 

online 

online 

Ohlebusch and Gog |17| 
Okanohara and Sadakane [IB] 
Starikovskaya [20] 

Yamamoto et al. [H] 
Karkkainen et al. [H] 
this paper 


By a more careful analysis, one can show that when e is not a constant, the 
running time of our algorithm is 0( ^ (log cr-blog we omit the details here. 

^ Throughout the paper, log denotes the logarithm with the base 2. 
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Preliminaries. Let w be a string of length n. Denote |r(;| = n. We write 
r(;[0],ui[l],... 1] for the letters of w and w[i..j] for A 

string can be reversed to get w = w[n—l] ■ ■ • ry[l]r(;[0] called the reversed w. A 
string u is a substring (or factor) oi w if u = w[L.j] for some i and j. The 
pair {i,j) is not necessarily unique; we say that i specifies an occurrence of u in 
w. A string can have many occurrences in another string. For i,j G Z, the set 
{k G Z: i < k < j} is denoted by [L.j]; [i-.j) denotes [L.j—1], 

Throughout the paper, s denotes the input string of length n over the integer 
alphabet [0 ..(t). Without loss of generality, we assume that a <n and cr is a power 
of two. Thus, s occupies n log cr bits. Simplifying the presentation, we suppose 
that s[0] is a special letter that is smaller than any letter in s[l..n—1]. 

Our model of computation is the unit cost word RAM with the machine word 
size at least logn bits. Denote r = log^. n = For simplicity, we assume that 
log n is divisible by log a. Thus, one machine word can contain a string of length 
< r; we say that it is a packed string. Any substring of s of length r can be packed 
in a machine word in constant time by standard bitwise operations. Therefore, 
one can compare any two substrings of s of length k in Oikjr + 1) time. 

The Lempel-Ziv decomposition of s is the decomposition s = ziZ 2 - ■ ■ zi such 
that each Zi is either a letter that does not occur in ziZ 2 - ■ ■ Zi-i or the longest 
substring that occurs at least twice in ziZ 2 ■ ■ ■ Zi (e.g., s = a-b-b-abbabb-c-ab-ab). 
The substrings zi, Z 2 t ■ ■, zi are called the Lempel-Ziv factors. Our algorithm 
consecutively reports the factors in the form of pairs (|zi|,pi), where pi is either 
the position of a nontrivial occurrence of Zi in ziZ 2 ■ • ■ Zi (it is called an earlier 
occurrence of Zi) or Zi itself if Zi is a letter that does not occur in ziZ 2 - ■ ■ 

The reported pairs are not stored in main memory. 

Fix a rational constant e > 0. It suffices to prove that our algorithm works 
in 0(n(logcr +log logn)) time and n log cr + 0(en) bits: the substitution e' = ce, 
where c is the constant under the bit-O, gives the required n log a + e'n bits with 
the same working time. We use different approaches to process the Lempel-Ziv 
factors of different lengths. In Section [2] we show how to process “short” factors 
of length <r/2. In Section[3]we describe new compact data structures that allow 
us to find all “medium” factors of length <(logn/e)^. In Section S] we apply the 
clever technique of [6] for the analysis of all other “long” factors. 

2 Short Factors 

In this section we consider the Lempel-Ziv factors of length < r/2, so we assume 
r >2. Suppose the algorithm has reported the factors zi,Z 2 t ..., Zk-i and now 
we process Zk. Denotep = |ziZ 2 • • • Zk-i\. We maintain arrays iJi, H 2 , ..., 
defined as follows: for f G [I..!"!]], the array Hi contains cr* integers such that for 
any x G [0..ct*), either Hi[x] equals the position from [0..p) of an occurrence in s 
of the packed string x of length i or Hi[x] = — 1 if there are no such positions. 

For each i G [l..r] and j G [0..n], denote by xj the packed string s[j..j-\-i—V\. 
We have Hx[xW = —1 iff is a letter that does not appear in s[0..p—1]; in this 
case the algorithm reports Zk immediately. Further, we have ^I|'r/ 2 ] [2^]-r/2]] 
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iff \zk\ > §; this case is considered in Sections EHH Suppose Hi[xW ^ —1 and 
Our algorithm finds the minimal q G such that 

Hq[xP] = —1. Then we obviously have \zk\ = <7—1 and is the position 

of an earlier occurrence of Zk- Clearly, the algorithm works in 0{\zk\) time. 

The inequality r = logn/logu > 2 implies a < y/n. Thus, Hi, H 2 , ■ ■ ■, H^^/i] 
altogether occupy at most r logn < a^a^rlogn < nirlogn = o(n) bits. 

To maintain Hi,..., H^r/ 2 ]: "we consecutively examine the positions j = 
0,1,...,p—1 and for those positions, for which i/ 1 -^/ 2 ] [3^|r/2] ] “ 
form the assignments Hi[x{] G- j,H 2 [x 2 ] ^ Ji • • ■)-f^rr/2] [2^|r/2]] ^ H^^ce, 
we execute these assignments for at most positions and the overall time 

required for the maintenance of Hi,... ,H^r / 2 ] is 0{n + ) = 0{n). 

3 Medium Factors 

Suppose the algorithm has reported the Lempel-Ziv factors zi,Z 2 , ■. ■, Zk-i and 
already decided that \zk\ > § applying the procedure of Section [H Denote p = 
\z 1 Z 2 ■ ■ ■ Zk-i\, T = , and 5 = |’en/(logcr+loglogn)]. We assumep+&+T^ < 

n; the case p+b+r"^ > n is analogous. Our algorithm processes s[0..p+6] and 
reports not only Zk but also all Lempel-Ziv factors starting in positions [p..p+b]. 

The algorithm consists of three phases: the first one builds for other phases an 
indexing data structure on the string s[p..p+b] in 0(6logcr) time and 0{b{logcr+ 
loglogn)) = 0{en) bits; the second phase scans s[0..p-|-6] in 0{n) time and fills a 
bit array k[0..6] so that for any i G [0..6], = 1 iff there is a Lempel-Ziv factor 

starting in the position p+i; finally, the last phase scans s[0..p-|-6] in 0{n) time 
and reports earlier occurrences of the found Lempel-Ziv factors. Thus, the overall 
time required by this algorithm is 0((n -I- 61ogcr)^) = 0{n{ioga + loglogn)). 

The data structures we use can search only the Lempel-Ziv factors of length 
< T^;we delegate the longer factors to the procedure of Sectional This restriction 
allows us to make our structures fast and compact. More precisely, our algorithm 
consecutively computes the lengths of the Lempel-Ziv factors starting in [p..p+b] 
and once we have found a factor of length > r^, we invoke the procedure of 
Section |4] to compute the length and an earlier occurrence of this factor. 

3.1 Main Tools 

Let a: be a string of length d-l-1. Denote Xi = a;[0..i]. The suffix array of x 
is the permutation SA[0..d] of the integers [0..(i] such that £s^[o] < < 

... < XsA[d] in the lexicographical order. The Burrows-Wheeler transform [7] 
of X is the string BWT[0..d] such that BWT[i] = a:[5'^[i]-1-1] if 5'T[i] < d and 
BWT[i] = a:[0] otherwise. We equip BWT with the function W defined as follows: 
d'{i) = -I- 1] if < d and d'{i) = 0 otherwise. 

Lemma 1 (see m)- The string B WT and the function T for a string x of 
length d-Gl over the alphabet [0..cr) can be constructed in 0((iloglogcr) time and 
0{dloga) bits of space; T is encoded in 0{dloga) bits with 0(1) access time. 
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Example 1. Consider the string x = %aabadcaababadcaaba. 


x [0.. 5-<4 [i] ] 

BWT[i\ 

S^[il 

f'b) 

i 

$ 

a 

0 

1 

0 

$a 

a 

1 

2 

1 

%aa 

b 

2 

11 

2 

%aahadcaa 

b 

8 

12 

3 

%aabadcaababadcaa 

b 

16 

13 

4 

%aaba 

d 

4 

17 

5 

%aabadcaaba 

b 

10 

14 

6 

%aabadcaababadcaaba 

$ 

IS 

0 

7 

%aabadcaababa 

d 

12 

18 

8 

%aabadca 

a 

7 

3 

9 

%aabadcaababadca 

a 

15 

4 

10 

%aab 

a 

3 

5 

11 

%aabadcaab 

a 

9 

6 

12 

$aabadcaababadcaab 

a 

17 

7 

13 

$aabadcaabab 

a 

11 

8 

14 

$aabadc 

a 

6 

9 

15 

$aabadcaababadc 

a 

14 

10 

16 

$aabad 

c 

5 

15 

17 

$aabadcaababad 

c 

13 

16 

18 


In the dynamic weighted ancestor (WA for short) problem one has 1) a 
weighted tree, where the weight of each vertex is greater than the weight of 
parent, 2) the queries finding for a vertex v and number i the ancestor of v with 
the minimal weight > *, 3) the updates inserting new vertices. Let n be a vertex 
of a trie T {v G T for short). Denote by lab{v) the string written on the path 
from the root to v. We treat tries as weighted trees: \lab{v)\ is the weight of v. 

Lemma 2 (see [13| L For a weighted tree with at most k vertices, the dynamic 
WA problem can be solved in 0{klogk) bits of space with queries and updates 
working in 0(log k) amortized time. 

One can easily modify the proof of m for a special case of this problem when 
the weights are integers [0..t^] and the height of the tree is bounded by r^. 

Lemma 3. Let T be a weighted tree with at most m < n vertices, the weights 
[O..T^], and the height <r^. The dynamic WA problem for T can be solved in 
0(m(logTO + loglogn)) bits of space with queries and updates working in 0(1) 
amortized time using a shared table of size o{n) hits. 

Proof. In m, using 0{m log m) additional bits of space, the general problem for 
a tree with m vertices, the weights [0..t^], and the height < is reduced to the 
same problem for subtrees with at most log log m vertices and the problem of 
the maintenance of a set of dynamic predecessor data structures on the weights 
[O..T^] so that each of these predecessor structures contains at most weights 
and all they contain 0{m) weights in total. Each query or update on the tree 
requires a constant number of queries/updates on the subtrees of size < log log m 
and on the predecessor structures. 

Since the weights are bounded by r^, a subtree with at most log log m 
vertices fits in O(loglogmlogr) = 0((loglogn)^) bits. So, we can perform 
queries and updates on these trees in 0(1) time using a shared table of size 
(9(20((iogiog")^) n) = o(n) bits. Further, one can organize a dynamic pre¬ 

decessor data structure with at most elements as a B-tree of a constant 
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depth with O(-yT)-element predecessor structures on each level. Any predeces¬ 
sor structure with 0 (i/t) weights fits in 0{^/T\og\ogn) bits and therefore, one 
can perform all operations on these small structures with the aid of a shared 
table of size 0(2^'°® log^*-^^ n) = o{n) bits. Thus we can perform all oper¬ 
ations on the source predecessor structure in 0(1) time. □ 

Denote by lcp{ti,t 2 ) the length of the longest common prefix of the strings 
ti and t 2 . Denote rlcp{i,j) = min{T2, 

Lemma 4 (see [2]). For a string x of length d+1, using BWT ofx, one can 
compute an array rlcp[0..d—l] such that rlcp[i] = rlcp{i,i+l), fori G [0..(i), in 
0{dloga) time and 0{dloga) bits; the array oceupies O(dloglogn) bits. 

3.2 Indexing Data Structure 

Trie. Denote d = 1-|-6-|-t^. The algorithm creates a string x of length d-l-l and 
copies the string slp-.p+b+r"^] in a;[l..(i]; a:[0] is set to a special letter less than 
any letter in 3 :|l..(i]. Let SA be the suffix array of x (we use it only conceptually). 

Denote x' = T^-|-l..i] (we assume that *[—1], a;[—2],... are equal to a;[0]). 

Here we discuss the design of our indexing data structure, a carefully packed in 
0{d{\oga + log log n)) bits augmented compact trie of the strings Xq,x'i, ... ,x'j^. 

For simplicity, suppose d is a multiple of r. The skeleton of our structure is a 
compact trie Qo of the strings : j G [0..d/r]}. We augment Qo with the 

WA structure of Lemma [S] Each vertex v G Qo contains the following fields: 1) 
the pointer to the parent of v (if any); 2) the pointers to the children of v in the 
lexicographical order; 3) the length of lab{v); 4) the length of the string written 
on the edge connecting v to its parent (if any). 

Notice that the fields 3)-4) fit in O(loglogn) bits. Clearly, Qo occupies 
O{{d/r) log n) = 0{d log a) bits of space. The pointers to the substrings of x 
written on the edges of Qo are not stored, so, one cannot use Qo for searching. 

We create an array L[0..d/r] such that for i G [0..d/r], L[i] is the pointer to 
the leaf of Qo corresponding to Now we build a compact trie Q inserting 

the strings in Qo for each i G [0..d/r) as follows. For a fixed i, 

these strings add to Qo trees Ti,...,T; attached to the branches and 

x'gA[{i+i)r] *3o (see Fig. [T]). We store Ti,..., T; in a contiguous memory block 
Fi. The pointer to Fi is stored in the leaf of Qo corresponding to x'gAy^-y so, 
one can find Fi in 0(1) time using L. Since Ti,..., T; have at most 2r vertices 
in total, O(loglogn) bits per vertex suffice for the fields l)-4). Now we discuss 
how Ti,... ,Ti are attached to Qq. Consider v G Qo and the vertices vi,... ,Vh 
splitting the edge connecting v to its parent in Qq. Let Ti^,..., Ti^ be the trees 
that must be attached to v,vi,... ,Vh (see Fig. [T|). We add to a memory block 
Ny containing the WA structure of Lemma |3] for the chain v,vi,... ,Vh with the 
weights \lab{v)\, \lab{vi )\,..., \lab{vh)\. Each of the vertices v,vi,..., Vh in this 
chain contains the 0(log log n)-bit pointers (inside Fi) to the roots of ,..., Ti^ 
attached to this vertex. Hence, occupies 0{{h -|-5) log log n) bits. One can 
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Fig. 1. Solid vertices and edges are from Qo- 


find the children for each of the vertices v,vi,... ,Vh 0(1) time using Qo and 
the chain in the block Ny. Further, one can find, for any j G [l-.ff], the parent of 
the root of Ti. in 0(1) time by a WA query on Qo to find a suitable v and a WA 
query on the chain in Ny. Finally, we augment each Ti with the WA structure 
of Lemma [3] Thus, by Lemma [31 Ti,..., T; add at most 0{r log log n) bits to Q. 

For each i G [O-.d/r), we augment the leaf referred by L[i\ with an array 
Li\})..r—2\ such that for j G [0..r—2], Li[j] is the 0(loglogn)-bit pointer (inside 
Fi) to the leaf of Q corresponding to So, for any j G [0..d], one can 

easily find the leaf of Q corresponding to in 0(1) time via L and . 

Finally, the whole described structure Q occupies 0(d(log cr + log log n)) bits. 

Prefix links. Consider v G Q. Denote by [iy-jy] the longest segment such that 
for each i G [iyjy], starts with lab{v) (see Fig. [3]). Let BWT be the 

Burrows-Wheeler transform of x. Denote the set of the letters of BWT[iy..jy\ 
by Py. We associate with v the prefix links mapping each c G Py to an integer 
Py(c) G [iyjy] such that x[SA[py{c)]+\] = c (there might be many such py(c)\ 
we choose any). The prefix links correspond to the well-known Weiner-links. 
Hence, Q has at most 0{d) prefix links. Observe that Py D Py for any ancestor 
u of V. The problem is to store the prefix links in 0{d{\oga + log log n)) bits. 

Fix f G [0..d). Denote by Vi the set of the vertices v ^ Qo such that v does 
not have descendants from Qo and lies between branches and 

We associate with each v G Vi a dictionary Dy mapping each c G Py to py{c)—ir 
and store all Dy, for v G Vi, vn a contiguous memory block Hi. Since \Vi\ < r 
and Py is a subset of BWT[ir..{i-\-V)r\, we have p^(c)—ir G [l..r) and all Dy, for 
V G Vi, occupy overall 0(X)^gv, |dli;|(logcr-|-loglogn)) = 0(r^ (log cr -flog log n)) 
bits of space. Therefore, we can store in each v G Vi the 0(loglogn)-bit pointer 
to Dy (inside Hi). The pointer to Hi itself is stored in the leaf referred by L[i]. 
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:r[0..5.4[z]]|Birr[z] 


-1 


^ciaba ^caal 
$aabndcnabnba Hcaa\ 
$ h.nba\ 

6 ^anbadc hnbal 

7%anbndcaabobadc haba\ | 

8 $nabadca dbaba\ 'd 

9 $fia} Mdcd\ 

10 ^aabodcaabat kdc^ 

11 

12 $anba( lcaaM 

13 $ nabndcaabnba ( ten aH 

14 ^aabade dabaH 

15 ^a dbadc\ 

16 ^aabndcaab dbad(\ 

17 $( 

18 $ aa ba d ran b h. ba d\ 








a 


BB -h—- 

l_l_l 

^dlsf 





cL- 





a 


Fig. 2. r = 4, the prefix finks associated with vertices are in squares. 


Consider v ^ Qo such that v lies on an edge connecting a vertex w € Qo 
to its parent in Qq. Let and strings corresponding to 

the leftmost and rightmost descendant leaves of w contained in Qo- We split Py 
on three subsets: Pi = {c G : p„(c) < jir}, P 2 = {c G P^,: p«(c) > j 2 r}, 
P 3 = P„ \ (Pi U P 2 ). Clearly P 3 C P^, C P„. Hence, we can use Py, instead of P 3 
and store only the sets Pi and P 2 in a way similar to that discussed above. 

Suppose V G Qo- Let for c G P„, jc G [iy-jv] be the position of the first 
occurrence of c in BWT[iy..jy]. Clearly, we can set Pv{c) = jc- We add to v & 
dictionary mapping each c G P„ to he = |{c' G P„: jc' < jc}\- Denote q = |Pt,|. 
Since q < a, the dictionary occupies 0{q\oga) bits. Now it suffices to map he 
to jc- Let jp,..., jq_i denote all jc, for c G Py, in increasing order. Obviously 
j'hc ~ P- "Lbe idea is to sample each (r^ logn)th position in BWT. We add to 
V a bit array 2l„[0..g—1] indicating the sampled jg,... ,j'q_i- ^c[0] = 1 and for 
h G [l-.q), Ay[h] = 1 iff j'j^_^ < It"^ log n < j'^ for an integer 1] Ay is equipped with 

the structure of m supporting the queries rA„(h) = in 0 ( 1 ) time and 

o(g) additional bits. The sampled sequence {j)^: Ay[h] = 1} is stored in an array 
By- Finally, we add an array 1 ] such that Cy[h] = j'f^ — By[rAy{h)—1]- 

Now we map h to j^^ as follows: j^^ = By[YAy{h)—1] + Cy[h\- Clearly, each value 
of Cy is in the range [O..T^logn] and hence, Cy occupies 0(qlog(T^ logn)) = 
0 (( 7 loglogn) bits. It suffices to estimate the space consumed by By. Since the 
number of the vertices in Qg is 0{d/r) and the height of Q is at most r^, all By 
arrays occupy at most 0{{d/r) logn + n = O(dlogcr) bits in total. 

Construction of Q. Initially, Q contains one leaf corresponding to We 

consecutively insert in Q in groups of r elements. During the 


















construction, we maintain on Q a set of the dynamic WA structures of Lemma [3] 
in such a way that one can answer any WA query on Q in 0(1) time. 

Suppose we have inserted in Q and now we are to insert 

x'gA[ir+i] ! • ■ ■ ! ^'sA[(i+i)r] ' allocate the memory block Fi required for new 

vertices. Using Lemma |4l we compute rlcp{j—l, j) for all j G [ir+l..(i+l)r]. 
Since rlcp{ji,j 2 ) — m.m{rlcp{ji, j 2 —i), rlcp{j 2 — l, j 2 )}i the algorithm can com¬ 
pute rlcp{ir,ir+j) for all j G [l..r] in 0(r) time. Using the WA query on the 
leaf x'gA[ir] the value rlcp(ir, (i-l-l)r), we find the position where we insert 
a new leaf x'gA[(i+i)rY Similarly, using the WA queries, we consecutively insert 
x'gA[ir+j] for j = 1, 2 ,... as long as rlcp(ir, ir+j) > rlcp{ir, (z-l-l)r) and then all 
other x'gAYi+i)r-j] for j = 1, 2,... (Fig.[T]). All related WA structures, the arrays 
L, Li, the pointers, and the fields for the vertices are built in an obvious way. 

One can construct the prefix links of a vertex from those of its children in 
0{q\oga) time, where q is the number of the links in the children. As there are 
at most 0{d) prefix links, one DFS traverse of Q builds them in 0{d\oga) time. 

Finally, using the result of m, the algorithm converts in 0{d\oga) time all 
dictionaries in the prefix links of the resulting trie Q in the perfect hashes with 
0(1) access time. So, one can access any prefix link in 0(1) time. 

3.3 Algorithm for Medium Factors 

In the dynamic marked descendant problem one has a tree, a set of marked 
vertices, the queries asking whether there is a marked descendant of a given 
vertex, and the updates marking a given vertex. We assume that each vertex is 
a descendant of itself. We solve this problem on Q as follows. 

Lemma 5. In 0{d(\oga -I- log log n)) bits one can solve the dynamic marked 
descendant problem on Q so that any k queries and updates take Oik d) time. 

Proof. Let q be the number of the vertices in Q. Obviously q = 0{d). We per¬ 
form a DFS traverse of Q in the lexicographical order and assign the indices 
0,1,..., q—1 to the vertices of Q in the order of their appearance in the traverse. 
Denote by idx{v) the index of a vertex v. We add to our structure a bit array 
M[0..q—1] initially filled with zeros. A vertex v is marked iff M[idx{v)] = 1. It is 
easy to see that the indices of the descendants of v form a contiguous segment 
[idx{v)..j] for some j > idx(v). So, the problem is to find for each vertex the 
segment of the descendant indices and then test whether there is an index k in 
this segment such that M[k] = 1. 

For each v G Qo, we store idx(v) and the segment of the descendant indices 
explicitly using O(logn) bits. Consider a vertex v ^ Qq. Let the leftmost descen¬ 
dant leaf of V corresponds to a string j where j = ir—k for some i G [i)..d/r] 
and k G [0..r). Denote by u the leaf corresponding to x'gAYr] - Since there are at 
most 2r vertices inserted between the leaves corresponding to x'gA^i-iy] 
^'sA[ir\ height of Q is at most t^, we have 0 < idx{u) — idx{v) < 2r-|-r^. 

So, we store in v the value idx(u) — idx{v) using 0(log logn) bits. Obviously, one 
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can compute idx(v) in 0(1) time using idx(u) stored explicitly. The structure 
occupies 0{{d/r) log n-\- d log log n) = 0(d(log a + log log n)) bits. 

Now it is sufficient to describe how to answer the queries on the segments of 
the dynamic bit array M. We can answer the queries on the segments of length 
< using a shared table occupying 0(2'°s"’/^ log^ n) = o(n) bits. So, the 
problem is reduced to the queries on the segments of the form [i logn..j logn). 
We build a perfect binary tree T with leaves corresponding to the segments 
logn..(i+l) logn) for i G [0..^/logn) (without loss of generality, we assume 
that <7 is a multiple of logn and q/logn is a power of 2). Each internal ver¬ 
tex u of T naturally corresponds to a segment [i2^ logn..(i-|-l)2'^ logn) for some 
i and j > 0. Denote c = + 2-1“^. We associate with v bit arrays Dy and 

Ey of lengths 2^~^ such that for any k G [1..2l“^], Dy\k—1] = 1 iff there are 
ones in the segment M[(c—fc) logn..clogn—1] and, similarly, Ey\k—1] = 1 iff 
there are ones in M[clogn..(c-|-fc) logn—1]. We construct on T the least com¬ 
mon ancestor structure (in the case of the perfect binary tree with 0{q/ logn) 
vertices, this can be simply done in 0{q) bits). Then, to answer the query on 
a segment [ilogn..j logn), we first find in 0(1) time the least common ances¬ 
tor V of the leaves of T corresponding to the segments [i logn..(z-|-l) logn) and 
[(j—1) logn..j logn) and then test appropriate bits of Dy and Ey. All in 0(1) 
time. The structure occupies 0( log = 0{d) bits. 

When we set M[i\ = 1 for some i G [0..q), the modifications are straight¬ 
forward: if the segment [[i/lognj ..[*/lognj-b logn)) already has ones, then 
we are done; otherwise, for each ancestor v of the leaf of T corresponding to 
[[i/lognJ..[j/lognJ-|-logn), we scan the array Dy [Ey] from left to right [right 
to left] from the appropriate position and flip all zero bits. Since there are only 
0{d) bits in the structure, the height of T is 0(log q) = 0(log n), and the updates 
are initiated at most q/ logn times, k updates run in 0{d+{q/ logn) logn-bfc) = 
0{d-\-k) time. □ 

Filling Iz. Denote Si = s[0..*]. Let for i G [0..p-|-d), ti denotes the longest prefix 
of Si presented in Q. We add to each v € Q an C)(loglogn)-bit field v.mlen 
initialized to r^. Also, we use an integer variable / that initially equals 0. 

The algorithm increases / computing \tf\ in each step and augments Q as 
follows. Suppose V € Q is such that t/-i is a prefix of lab(v) and other vertices 
with this property are descendants of v. We say that v corresponds to t/-i. We 
are to find the vertex of Q corresponding to tf. Suppose Pj;(s[/]) is defined. 
By LemmalU one can compute i = d'{py{s[f])) in 0(1) time. Obviously, 
starts with s[f]tf-i. We obtain the leaf corresponding to in 0(1) time via 
L and T[i/rj and then find w G Q corresponding to tf by the WA query on the 
obtained leaf and the number min{r^, |ty_i|-|-l}. Suppose py{s[f]) is undefined. 
If V is the root of Q, then we have \tf\ =0. Otherwise, we recursively process 
the parent u of u in the same way as v assuming t/-i = lab{u). Finally, once 
we have found w G Q corresponding to tf, we mark the parent of w using the 
structure of Lemma [S] and assign w.mlen G- mhi{w.mlen, \lab{w)\ — \tf\}. 

Let i G [p..f+l] such that |s[f../-|-l]| < r^. Suppose all positions [0../] are 
processed as described above. It is easy to verily that the string s[i../-|-l] has 
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an occurrence in s[0../] iff either the vertex v € Q corresponding to s[i../+l] 
has a marked descendant or the parent of v is marked and \lab{v) \ — v.mlen > 
|s[i../+l]|. Based on this observation, the algorithm computes Iz as follows. 

1: for {t ^ p-, t <p + b\ t ^ t + max{l, z}) do 

2: for (z ^ 0, n ^ the root of Q; true; v ■(— w, z <— z + 1) do 

3: increase / processing Q accordingly until f = t + z — 1 

4: if z > then invoke the procedure of Section0]to find z and break; 

5: find w € Q corresp. to s[t..t+z] using v, prefix links, WA queries 

6: if w is undefined then break; 

7: if w do not have marked descendants then 

8: if parent{w) is not marked or \lab{w) \ — w.mlen < z then break; 

9: lz[t—p] •<— 1; 

The lengths of the Lempel-Ziv factors are accumulated in z. The above observa¬ 
tion implies the correctness. Line [5] is similar to the procedure described above. 
Since 0{n) queries to the prefix links and 0(n) markings of vertices take 0{n) 
time, by standard arguments, one can show that the algorithm takes 0(n) time. 
Searching of occurrences. Denote by Z the set of all Lempel-Ziv factors of 
lengths [r/2..r^) starting in [p..p+b]. Obviously \Z\ = 0{d/r). Using /z, we build 
in 0(dlog a) time a compact trie R of the strings { z : z G Z}. We add to each v £ 
R such that z„ = lab{v) G Z the list of all starting positions of the Lempel-Ziv 
factors z„ in [p..p+b]. Obviously, R occupies 0{{d/r) logn) = 0{d\oga) bits. We 
construct for the strings Z a succinct Aho-Corasick automaton of [1] occupying 
0{{d/r)logn) = 0{d\oga) bits. In [1] it is shown that the reporting states of the 
automaton can be associated with vertices of R, so that we can scan s[0..p-|-(i—1] 
in 0{n) time and store the found positions of the first occurrences of the strings 
Z in R. Finally, by a DFS traverse on R, we obtain for each string of Z the 
position of its first occurrence in s[0..p-|-(i—1]. To find earlier occurrences of other 
Lempel-Ziv factors starting in [p..p+b], we use the algorithms of Sections [21 SI 


4 Long Factors 


4.1 Main Tools 


Let fc G N. A set D C [0..fc) is called a difference cover of [0..fc) if for any 
X G [0..fc), there exist y,z G D such that y—z = x (mod k). Obviously \D\ > y/k. 
Conversely, for any fc G N, there is a difference cover of [0..fc) with 0(p/k) 
elements and it can be constructed in 0{k) time (see [6]). 


Example 2. The set D = {1, 2,4} is a difference cover of [0..5). 
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X 

0 

1 

2 

3 

4 

y,z 

1,1 

2,1 

1,4 

4,1 

1,2 



Lemma 6 (see IS!)- Let D be a dijferenee cover of [0..fc). For any integers i,j, 
there exists d G [0../c) such that (i — d) mod k G D and {j — d) mod k G D. 


An ordered tree is a tree whose leaves are totally ordered (e.g, a trie). 

Lemma 7 (see [16] ). In 0{k log k) bits of space we can maintain an ordered 
tree with at most k vertices under the following operations: 

1.insertion of a new leaf (possibly splitting an edge) in O(logfc) time; 

2.searching of the leftmost/rightmost descendant leaf of a vertex in 0 (log k) time. 

Lemma 8 (see m)- A linked list can be designed to support the following op¬ 
erations: 1. insertion of a new element in 0(1) amortized time; 2. determine 
whether x precedes y for given elements x and y in 0(1) time. 


To support fast navigation in tries, we associate with each vertex v a dictio¬ 
nary mapping the first letters in the labels written on the outgoing edges of v to 
the corresponding children of v. So, whether a trie contains a string with a prefix 
w can be checked in 0(|w| logp) time, where p is the alphabet size. Notice that 
a compact trie for a set of k substrings of the string s can be stored in 0{k logn) 
bits using pointers for the edge labels. But the described searching time is too 
slow for our purposes, so, using packed strings and fast string dictionaries, we 
improve our tries with the operations provided in the following lemma. 

Lemma 9. In 0{k\ogn) bits of space we can maintain a compact trie for at 
most k substrings of s under the following operations: 

1. insertion of a string w in 0(\w\/r -G logn) amortized time; 

2. searching of a string w in 0{\u\/r-Glogn) time, where u is the longest prefix of 
w present in the trie; we scan w from left to right r letters at a time and report 
the vertices of the trie corresponding to the prefixes of lengths r, 2 r,..., [|u|/rjr, 
and |u| immediately after reading these prefixes. 


Proof. Denote by S the set of all strings stored in T. For a substring t of the 
string s, denote by t' a string of length [|t|/rj such that for any i G [ 0 ..|t'|), t'[i] 
is equal to the packed string t[ri..r[i-\-l.)—l]. We maintain a special compact trie 
T' containing the set of strings {t': t G S}: the dictionaries associated with the 
vertices of T' are organized in such a way that the searching and insertion of 
a string w' both work in 0(|r(;'| -I- logfc) amortized time; such tries are called 
dynamic ternary trees (see [9] for a comprehensive list of references). For each 
V G T, we insert in T' a vertex corresponding to the string t' (if there is no such 
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Fig. 3. A compact trie T is on the left; the corresponding ternary tree T' is on the right. 
If r = 4, the searching of w = aaaaccccaaaac reports the vertices 6,6,8,8 corresponding 
to the prefixes of lengths r, 2r, 3r, and |u)|, respectively. 


vertex), where t = lah{v) (consider the vertices 3 on the left and 2 on the right of 
Fig. [3|). All vertices of T' are augmented with the pointers to the corresponding 
vertices of T (depicted as dashed lines in Fig. [3]). 

Let w be a string to be searched in T. Using the pointers of T', we can report 
vertices corresponding to the prefixes t(;[0..r—1], ui[0..2r—1 ],..., ui[0..|t(;'|r—1] 
while traverse T'. Denote by u the longest prefix of w presented in T. Once u' 
is found in T' in 0{\u'\ + logfc) time, we start to traverse T reading the string 
u[|u'|r..|u| —1] from the position corresponding to M[0..|w'|r—1]. This operation 
requires additional 0(r log cr) = O(logn) time. The insertion is analogous. □ 

In the dynamic tree range reporting problem one has ordered trees Ti and 
T 2 and a set of pairs Z = {{x\,X 2 )}, where x\ and x^ are leaves of Ti and T 2 , 
respectively (see Fig. II; the query asks, for given vertices vi € Ti and V2 & T2, 
to find a pair (xi,X 2 ) € Z such that xi and X 2 are descendants of vi and V 2 , 
respectively; the update inserts new pairs in Z or new vertices in Ti and T 2 . To 
solve this problem, we apply the structure of [5] and Lemmas [7] and [H 

Lemma 10. The dynamic tree range reporting problem with |.^| < fc can be 
solved in O(fclogfc) bits of space with updates and queries working in 0{\ogk) 
amortized time. 

Proof. To prove this Lemma, we need an additional tool. In the dynamie or¬ 
thogonal range reporting problem one has two linked lists X and Y, and a set 
of pairs Z = {{xi,yi)}, where Xi € X and yi € Y; the query asks to report for 
given elements xi,X 2 € X and yi,y 2 G Y, a pair {x,y) € Z such that x lies 
between xi and X 2 in X, and y lies between yi and 2/2 in F; the update inserts 
new pairs in Z or new elements in X or Y. 
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Lemma 11 (see 0). The dynamic orthogonal range reporting problem on at 
most k pairs can be solved in 0{k log k) bits of space with updates and queries 
working in 0(log/c) amortized time. 

We maintain the ordered tree structure of Lemma [7] on Ti and T 2 . The order 
on the lists of leaves of Ti and T 2 is maintained with the aid of enhanced linked 
lists of Lemma IHl To process queries efficiently, we build the dynamic orthogonal 
range reporting structure of Lemma [TT] on these lists and the set of pairs Z. 
These structures take overall 0{k log k) bits of space. By Lemmas [8l [71 [TTl the 
update of Ti, T 2 , or Z requires O(logfc) amortized time. 

Suppose we process a query for vertices Vi G Ti and U2 G Tj. We obtain the 
leftmost and rightmost descendant leaves of Vi and V 2 using Lemma [T] Then 
we report a desired pair from Z (or decide that there are no such pairs) using 
Lemma [TTl By Lemmas 171 and [TTl the query takes O(logfc) amortized time. □ 

4.2 Algorithm for Long Factors 

Data structures. At the beginning, using the algorithm of [6], our algorithm 
constructs a difference cover D of [0..t^) such that |I?| = 0(r). Denote M = {i G 
[0..n): i mod G D}. The set M is the basic component in our constructions. 

Suppose the algorithm has reported the Lempel-Ziv factors zi, Z2 ,..., Zk-i 
and already decided that |zfc| > applying the procedure of Section |3l Denote 
p = |ziZ2 • • • Zk-i\. We use an integer variable z to compute the length of |zfc| 
and z is initially equal to r^. Let us first discuss the related data structures. 

We use an auxiliary variable t such that p < t < p+z at any time of the work; 
initially t = p. Denote Si = s[0..i]. Our main data structures are compact tries 
S and T: S contains the strings s) and T contains the strings s[i+l..i+T^] for 
all i G [0..t) n M (we append letters s[0] to the right of s so that s[i+l..i+T^] 
is always defined). Both S and T are augmented with the structures supporting 
the searching of Lemma 0 and the tree range queries of Lemma HU] on pairs of 
leaves of S and T. Since s[0] is a sentinel letter, each Si, for i G [0..t) O M, is 
represented in S' by a leaf. The set of pairs for our tree range reporting structure 
contains the pairs of leaves corresponding to s) in S and s[j+l..i+T^] in T for 
all i G [0..t) n M (see Fig. [4]). Also, we add to S the WA structure of Lemma|2j 

Let us consider vertices v G S and v' G T corresponding to strings t y and ty /, 
respectively. Denote by treeRng(ri, v') the tree range query that returns either nil 
or a suitable pair of descendant leaves of v and v'. We have treeRng(w, v') ^ nil 
iff there is z G [0..t) flM such that s[i—|t.i,|+l..j]s[i+l..i+|t„'|] = tyty'. 

Since |M| < :^|D| = 0(y), it follows from Lemmas ITOl that S and T 
with all related structures occupy at most 0(^ logn) = 0{en) bits. 

The algorithm. Suppose the factor Zk occurs in a position x G [0..p); then, by 
LemmalHl there is a d G [0..t^) such that x+ |^fc| — dG M and p+ |zfc| — dG M. 
Based on this observation, our algorithm, for each t G Md [p..z), finds the vertex 
V G S corresponding to s[p..f\ and the vertex v' G T corresponding to as long as 
possible prefix of s[t+l..n+r^] such that treeRng(r!, ti') ^ nil and with the aid 
of this bidirectional search, we further increase z if it is possible. 
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$aahadcaahahadcaahadcaahaaa$ 



Fig. 4. r = 3, -D = {0, 1, 3, 6} is a diff. cover of [0..r^), positions in M are nnderlined. 


1: for (< ^ inin{i > p: i G M}; t < p + z; t G- min{i > t: i G M}) do 
2: X G- the length of the longest prefix of s[t+l..t+T^] present in T 

3: y G- the length of the longest prefix of st present in S 

4: iiy<t — p+1 then go to line [13] 

5: V G- the vertex corresp. to the longest prefix of sj present in S 

6: r ■(—weiAnc(r;, t — p + 1); 

7: for j = t, t+r, t+2r,..., t+[x/r\r, x and v' GT corresp. to s[t+l..j] do 

8 : if j>p + z then [> |s[p..j]| > \s[p..p+z—l]\ 

9: if treeRng(n, v') = nil then 

10: j G- max{/: treeRng(n, u)7fnil for u S T corresp. s[t+l..j']}; 

11: z ^ maxjz, j — p + 1}; 

12: if treeRng(r;, v') = nil then break; 

13: insert s[t+l..t+T^] in T, St in S] process the pair of the corresp. leaves 

Some lines need further clarification. Here weiAnc(z;,i) denotes the WA query 
that returns either the ancestor of v with the minimal weight > i or nil if there 
is no such ancestor; we assume that any vertex is an ancestor of itself. Since 
M has period one can compute, for any t, min{j > t: i G M} in 0(1) time 
using an array of length for example. The operations on T in lines [2l [Intake, 
by Lemma ini 0(r^/r + logn) time. To perform the similar operations on S in 
lines ElISlIIl we use other techniques (discussed below) working in the same 
time. The loop in line |7| executes exactly the procedure described in Lemma |9l 
To compute j in line 1101 we perform the binary search on at most r ancestors of 
the vertex v'; thus, we invoke treeRng O(logr) times in line 1101 

Let us prove the correctness. Suppose we have < z < in some iteration. 
It suffices to show that the algorithm cannot terminate with this value of z. Let 
Zfc occur in a position x G [0..p). By Lemma lU there is a d G [0..r^) such that 
x + z — dG M and p + z — d G M. Thus, the string s[p..p+z—d] is presented in S 
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when t = p+z — d and we find the corresponding vertex v in hne|6l Moreover, the 
string s[p+z—d+l..p+z] is presented in T and we find the vertex corresponding 
to this or a longer string in the loop [71 4121 Denote this vertex by ic; w is either 
v' or u in line [TUI Obviously, treeRng(z;, w) 7^ nil, so, we increase z in linefm 

Let us estimate the running time. The main loop performs Odz^j/r) it¬ 
erations. The operations in lines |2l El |5l [IS] require, as mentioned above, 
0{t^ jr -I- logn) time (some of them will be discussed in the sequel). One WA 
query and one modification of the tree range reporting structure take, by Lem¬ 
mas [2] and [101 O(logn) time. By Lemma El the traverse of T in line [7l requires 
0{t'^ jr + logn) time. For each fixed t, every time we perform treeRng query in 
line El except probably for the first and last queries, we increase z by r. Hence, 
the algorithm executes at most 0{\zk\/T + |z/c|/r) such queries in total. Finally, 
in line EQl we invoke treeRng at most O(logr) times for every fixed t. Putting 
everything together, we obtain 0('^ (r^/r-flog n) + log n-f log r log n) = 
0(|zfc| log cr-I-|zfc| logr) = 0( I Zfc I (log cr-I-log logn)) overall time. 

One can find the position of an early occurrence of Zk from the pairs of leaves 
reported in lines lUl [TUI Now let us discuss how to insert and search strings in S. 
Operations on S. The operations on S are based on the fact that for any 
i G [T^..n) nM, i — M. Let u and v be leaves of S corresponding to some Sj 
and Sk. To compare Sj and Sk in 0(1) time via u and v, we store all leaves of S in 
a linked list K of Lemma [S| in the lexicographical order. To calculate lcp{tj,Sk) 
in O(logn) time via u and v, we put all leaves of S in an augmented search tree 
B. Finally, we augment S with the ordered tree structure of Lemma [7] 

Denote s' = s[*—r^+l..i]. We add to ^ a compact trie S' containing s' for all 
i G [0..t) n M (we assume s[0]=s[—1]= ..., so. S' is well-defined). The vertices of 
S' are linked to the respective vertices of S. Let w be a leaf of S' corresponding 
to a string s'. We add to w the set H^, = {(pi,P2) ■ 3 ^ [0--t) H M and s'- = s'}, 
where p{ and P 2 tti® pointers to the leaves of S corresponding to *s and 
Sj, respectively; iJu, is stored in a search tree in the lexicographical order of 
the strings ^j-r^ referred by p{, so, one can find, for any k G [0..t+r^) fl M, 
the predecessor or successor of the string *s k-r^ in H^i in O(logn) time. It is 
straightforward that all these structures occupy O(^logn) = 0{en) bits. 

Suppose S contains si for all i G [0..t)r\M and we insert s*. We first search s( 
in S'. Suppose S' does not contain s(. We insert s( in S' in 0(t‘^/ r+ logn) time, 
by Lemma El then add to S the vertices corresponding to the new vertices of S' 
and link them to each other. Using the structure of Lemma [3 on S, we find the 
position of Sj in K in O(logn) time. All other structures are easily modified in 
O(logn) time. Now suppose S' has a vertex w corresponding to s(. In O(logn) 
time we find in Hyj the pairs {p{,P 2 ) and {pi,P 2 ) such that p{ points to the 
predecessor *Sj-r'^ of *St-T^ in Hyj and Pi points to the successor *Sk-T^- So, 
the leaf corresponding to s) must be between Sj and *Sk- Using B, we calculate 
lcp(sj,st) = lcp(sj_r 2 ,'s and, similarly, lcp{*Sk,St) in O(logn) time 
and then find the position where to insert the new leaf by WA queries on S. All 
other structures are simply modified in O(logn) time. Thus, the insertion takes 
jr -f logn) time. One can use a similar algorithm for the searching of Sj. 
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